D² Brier Score (d2_brier_score)#
The Brier score measures how close your predicted probabilities are to the true 0/1 outcomes. The D² Brier score turns that into an explained fraction (similar to \(R^2\)):
\(D^2 = 1\) is perfect probability predictions
\(D^2 = 0\) is a constant “predict the base rate” baseline
\(D^2 < 0\) means worse than the baseline
This notebook covers:
intuition + plots (Plotly)
the math (LaTeX)
a from-scratch NumPy implementation
using the metric as an optimization objective (logistic regression trained by minimizing Brier loss)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.calibration import calibration_curve
from sklearn.datasets import make_classification
from sklearn.metrics import brier_score_loss, r2_score
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Problem setting#
We focus on binary classification.
True labels: \(y_i \in \{0, 1\}\)
Predicted probability for the positive class: \(p_i \in [0, 1]\) where
The key is that we evaluate probabilities, not hard class labels.
2) Brier score (the underlying loss)#
For binary outcomes, the Brier score is just mean squared error in probability space:
With non-negative sample weights \(w_i\):
Interpretation: each example contributes a squared error term.
Predict \(p_i \approx 1\) when \(y_i = 1\) and \(p_i \approx 0\) when \(y_i = 0\) → small loss
Be very confident and wrong (e.g. \(p_i \approx 1\) but \(y_i = 0\)) → loss close to 1
The Brier score is a proper scoring rule: in expectation, it is minimized by predicting the true probability.
p = np.linspace(0, 1, 401)
fig = go.Figure()
fig.add_trace(go.Scatter(x=p, y=(p - 0.0) ** 2, mode="lines", name="y=0"))
fig.add_trace(go.Scatter(x=p, y=(p - 1.0) ** 2, mode="lines", name="y=1"))
fig.update_layout(
title="Brier loss for a single example: (p - y)^2",
xaxis_title="Predicted probability p",
yaxis_title="Squared error",
)
fig.update_xaxes(range=[0, 1])
fig.update_yaxes(range=[0, 1])
fig.show()
3) D² Brier score (fraction of Brier explained)#
The D² Brier score compares your model to a simple reference predictor:
Reference probability: the empirical base rate
Reference predictions: \(p_i^{\text{ref}} = \bar{y}\) for all \(i\)
The reference Brier score is the Brier score of this constant predictor:
For binary \(y \in \{0,1\}\), this simplifies to:
The D² Brier score is then:
How to read it
\(D^2=1\): perfect probabilities (Brier score is 0)
\(D^2=0\): no improvement over predicting the base rate
\(D^2<0\): worse than the base-rate predictor
Edge case: if all labels are the same, then \(\mathrm{BS}_{\text{ref}} = 0\) and \(D^2\) is undefined.
A common convention (used by sklearn.metrics.r2_score with force_finite=True) is:
return 1.0 if your predictions are perfect
otherwise return 0.0
4) Connection to \(R^2\)#
For binary targets, the D² Brier score is exactly the coefficient of determination computed on \((y, p)\):
That is the same algebraic form as \(R^2\).
This is useful as a sanity check:
from sklearn.metrics import r2_score
d2_brier = r2_score(y_true, y_prob)
scikit-learn exposes brier_score_loss but doesn’t currently ship a d2_brier_score convenience function.
You can compute it directly from Brier scores with the base-rate reference:
import numpy as np
from sklearn.metrics import brier_score_loss
p_ref = y_true.mean()
d2_brier = 1 - brier_score_loss(y_true, y_prob) / brier_score_loss(
y_true,
np.full_like(y_true, p_ref, dtype=float),
)
But conceptually, D² Brier is about probabilistic classification and is usually discussed together with calibration diagnostics (reliability diagrams).
def brier_score_loss_np(y_true, y_prob, sample_weight=None):
y_true = np.asarray(y_true, dtype=float)
y_prob = np.asarray(y_prob, dtype=float)
err2 = (y_prob - y_true) ** 2
if sample_weight is None:
return float(np.mean(err2))
w = np.asarray(sample_weight, dtype=float)
return float(np.average(err2, weights=w))
def d2_brier_score_np(y_true, y_prob, sample_weight=None):
y_true = np.asarray(y_true, dtype=float)
y_prob = np.asarray(y_prob, dtype=float)
if sample_weight is None:
p_ref = float(np.mean(y_true))
else:
p_ref = float(np.average(y_true, weights=sample_weight))
y_ref = np.full_like(y_true, p_ref, dtype=float)
bs = brier_score_loss_np(y_true, y_prob, sample_weight=sample_weight)
bs_ref = brier_score_loss_np(y_true, y_ref, sample_weight=sample_weight)
if np.isclose(bs_ref, 0.0):
return 1.0 if np.isclose(bs, 0.0) else 0.0
return 1.0 - bs / bs_ref
def plot_reliability_diagram(y_true, probas_by_name, n_bins=12, title="Reliability diagram"):
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
name="Perfect calibration",
line=dict(dash="dash"),
)
)
for name, p in probas_by_name.items():
prob_true, prob_pred = calibration_curve(
y_true, p, n_bins=n_bins, strategy="uniform"
)
fig.add_trace(
go.Scatter(
x=prob_pred,
y=prob_true,
mode="lines+markers",
name=name,
)
)
fig.update_layout(
title=title,
xaxis_title="Mean predicted probability",
yaxis_title="Fraction of positives",
legend_title="Model",
)
fig.update_xaxes(range=[0, 1])
fig.update_yaxes(range=[0, 1], scaleanchor="x", scaleratio=1)
return fig
# Quick sanity checks
y = rng.integers(0, 2, size=2000)
p = rng.random(size=y.shape[0])
bs_np = brier_score_loss_np(y, p)
bs_sklearn = brier_score_loss(y, p)
d2_np = d2_brier_score_np(y, p)
d2_r2 = r2_score(y, p)
print(f"Brier (NumPy) : {bs_np:.6f}")
print(f"Brier (sklearn): {bs_sklearn:.6f}")
print(f"D² Brier (NumPy): {d2_np:.6f}")
print(f"r2_score : {d2_r2:.6f}")
assert np.allclose(bs_np, bs_sklearn)
assert np.allclose(d2_np, d2_r2)
# Baseline should give D² = 0
p_base = np.full_like(y, y.mean(), dtype=float)
assert np.allclose(d2_brier_score_np(y, p_base), 0.0)
Brier (NumPy) : 0.332926
Brier (sklearn): 0.332926
D² Brier (NumPy): -0.331711
r2_score : -0.331711
5) Intuition: calibrated vs. miscalibrated probabilities#
We’ll generate data from a known probability model so we can create different types of predictors:
Oracle: predicts the true probability (best possible Brier score)
Underconfident: probabilities are pushed toward 0.5
Overconfident: probabilities are pushed toward 0 or 1
Baseline: always predicts the base rate \(\bar{y}\)
Then we compare Brier and D² Brier and visualize calibration.
def sigmoid(z):
z = np.asarray(z)
out = np.empty_like(z, dtype=float)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
exp_z = np.exp(z[~pos])
out[~pos] = exp_z / (1.0 + exp_z)
return out
n = 6000
x = rng.normal(size=n)
p_true = sigmoid(1.8 * x - 0.2)
y = rng.binomial(1, p_true)
logit_true = np.log(p_true / (1 - p_true))
p_oracle = p_true
p_under = sigmoid(0.5 * logit_true)
p_over = sigmoid(2.0 * logit_true)
p_base = np.full_like(y, y.mean(), dtype=float)
models = {
"baseline (̅y)": p_base,
"underconfident": p_under,
"oracle": p_oracle,
"overconfident": p_over,
}
names = list(models.keys())
briers = np.array([brier_score_loss_np(y, models[name]) for name in names])
d2s = np.array([d2_brier_score_np(y, models[name]) for name in names])
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Brier score (lower is better)", "D² Brier score (higher is better)"),
)
fig.add_trace(go.Bar(x=names, y=briers), row=1, col=1)
fig.add_trace(go.Bar(x=names, y=d2s), row=1, col=2)
fig.update_layout(title="Same data, different probability quality", showlegend=False)
fig.update_yaxes(title_text="Brier", row=1, col=1)
fig.update_yaxes(title_text="D²", row=1, col=2)
fig.show()
fig = plot_reliability_diagram(
y_true=y,
probas_by_name=models,
title="Reliability diagram: calibration differences show up in Brier/D²",
)
fig.show()
6) Per-example contributions (what drives the score?)#
The Brier score is an average of \((p_i - y_i)^2\). Looking at the worst individual contributions helps you see why a model is doing poorly.
Below we visualize the biggest squared errors for the overconfident predictor.
contrib = (p_over - y) ** 2
idx = np.argsort(contrib)[::-1][:60]
colors = np.where(y[idx] == 1, "#d62728", "#1f77b4")
hover = [f"p={p_over[i]:.3f}<br>y={int(y[i])}<br>(p-y)^2={contrib[i]:.3f}" for i in idx]
fig = go.Figure(
data=[
go.Bar(
x=np.arange(len(idx)),
y=contrib[idx],
marker_color=colors,
hovertext=hover,
hoverinfo="text",
)
]
)
fig.update_layout(
title="Largest Brier contributions (overconfident predictor)",
xaxis_title="Worst examples (rank)",
yaxis_title="(p - y)^2",
)
fig.show()
7) Using D² Brier for optimization (logistic regression)#
On a fixed dataset, the reference Brier score \(\mathrm{BS}_{\text{ref}}\) depends only on \(y\). So maximizing D² Brier is equivalent to minimizing the Brier score:
We’ll fit a simple logistic regression model:
Using the Brier score as the objective:
The gradient is (vectorized):
where \(p = \sigma(Xw)\) and \(\odot\) is elementwise multiplication.
We’ll also train the same model with log loss for comparison.
X, y = make_classification(
n_samples=3500,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
class_sep=1.2,
flip_y=0.03,
random_state=42,
)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.30, random_state=42, stratify=y
)
# Standardize for nicer optimization
mean = X_train.mean(axis=0)
std = X_train.std(axis=0) + 1e-12
X_train = (X_train - mean) / std
X_val = (X_val - mean) / std
def add_intercept(X):
return np.c_[np.ones((X.shape[0], 1)), X]
X_train_i = add_intercept(X_train)
X_val_i = add_intercept(X_val)
X_train_i.shape, X_val_i.shape
((2450, 3), (1050, 3))
def fit_logistic_gd(
X_train,
y_train,
X_val,
y_val,
*,
loss="brier",
lr=0.5,
n_epochs=800,
l2=0.0,
seed=0,
):
rng_local = np.random.default_rng(seed)
w = rng_local.normal(scale=0.1, size=X_train.shape[1])
history = {
"brier_train": np.empty(n_epochs),
"d2_train": np.empty(n_epochs),
"brier_val": np.empty(n_epochs),
"d2_val": np.empty(n_epochs),
}
for epoch in range(n_epochs):
# Forward
p_train = sigmoid(X_train @ w)
p_val = sigmoid(X_val @ w)
# Metrics
history["brier_train"][epoch] = brier_score_loss_np(y_train, p_train)
history["d2_train"][epoch] = d2_brier_score_np(y_train, p_train)
history["brier_val"][epoch] = brier_score_loss_np(y_val, p_val)
history["d2_val"][epoch] = d2_brier_score_np(y_val, p_val)
# Gradient
if loss == "brier":
grad = (2.0 / X_train.shape[0]) * X_train.T @ (
(p_train - y_train) * p_train * (1.0 - p_train)
)
elif loss == "log":
# Negative log-likelihood / cross-entropy
grad = (1.0 / X_train.shape[0]) * X_train.T @ (p_train - y_train)
else:
raise ValueError("loss must be 'brier' or 'log'")
# L2 regularization (skip intercept)
if l2:
reg = np.r_[0.0, w[1:]]
grad = grad + l2 * reg
# Update
w = w - lr * grad
return w, history
w_brier, hist_brier = fit_logistic_gd(
X_train_i,
y_train,
X_val_i,
y_val,
loss="brier",
lr=0.8,
n_epochs=800,
l2=1e-3,
seed=1,
)
w_log, hist_log = fit_logistic_gd(
X_train_i,
y_train,
X_val_i,
y_val,
loss="log",
lr=0.3,
n_epochs=800,
l2=1e-3,
seed=1,
)
w_brier, w_log
(array([0.5236, 3.7584, 1.0648]), array([0.5545, 4.6785, 1.1888]))
epochs = np.arange(len(hist_brier["brier_train"]))
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Validation Brier score", "Validation D² Brier"),
)
fig.add_trace(
go.Scatter(x=epochs, y=hist_brier["brier_val"], mode="lines", name="GD (Brier objective)"),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=epochs, y=hist_log["brier_val"], mode="lines", name="GD (Log-loss objective)"),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=epochs, y=hist_brier["d2_val"], mode="lines", name="GD (Brier objective)"),
row=1,
col=2,
)
fig.add_trace(
go.Scatter(x=epochs, y=hist_log["d2_val"], mode="lines", name="GD (Log-loss objective)"),
row=1,
col=2,
)
fig.update_layout(title="Training dynamics (validation)")
fig.update_yaxes(title_text="Brier", row=1, col=1)
fig.update_yaxes(title_text="D²", row=1, col=2)
fig.show()
p_val_brier = sigmoid(X_val_i @ w_brier)
p_val_log = sigmoid(X_val_i @ w_log)
p_val_base = np.full_like(y_val, y_train.mean(), dtype=float)
print("Validation metrics")
print("-" * 60)
for name, p_val in {
"baseline (train ̅y)": p_val_base,
"logistic GD (Brier objective)": p_val_brier,
"logistic GD (Log-loss objective)": p_val_log,
}.items():
bs = brier_score_loss_np(y_val, p_val)
d2 = d2_brier_score_np(y_val, p_val)
print(f"{name:30s} Brier={bs:.4f} D²={d2:.4f}")
fig = plot_reliability_diagram(
y_true=y_val,
probas_by_name={
"baseline": p_val_base,
"GD (Brier objective)": p_val_brier,
"GD (Log loss objective)": p_val_log,
},
title="Calibration on validation set",
)
fig.show()
Validation metrics
------------------------------------------------------------
baseline (train ̅y) Brier=0.2500 D²=-0.0000
logistic GD (Brier objective) Brier=0.0410 D²=0.8358
logistic GD (Log-loss objective) Brier=0.0385 D²=0.8459
8) Pros, cons, and where to use it#
Pros
Evaluates probabilities directly (not just class labels)
Sensitive to calibration (reliability) and accuracy
Proper scoring rule: incentivizes honest probabilities
D² gives a normalized scale: “fraction of Brier explained” relative to the base-rate predictor
Cons / pitfalls
Needs well-defined probabilities; feeding hard labels (0/1) throws away information
If the dataset is extremely imbalanced, \(\bar{y}(1-\bar{y})\) can be very small → D² can become unstable
If all labels are identical, the score is undefined (denominator 0)
As a training objective with a sigmoid model, Brier loss has an extra \(p(1-p)\) factor in the gradient (can lead to smaller gradients when \(p\) saturates near 0/1 compared to log loss)
Good use cases
Risk prediction / decision support where probability calibration matters (medicine, finance, fraud)
Comparing probabilistic models against a simple baseline
Monitoring probabilistic models over time (drift can show up as calibration degradation)
Diagnostics to pair with it
Reliability diagram (calibration curve)
Histogram of predicted probabilities (do you predict only around 0.5?)
If you also care about ranking/discrimination, add AUC/PR-AUC alongside Brier/D²
9) Exercises#
Create a model that ranks perfectly but is badly calibrated (e.g. apply an aggressive temperature to logits). What happens to AUC vs. Brier/D²?
Implement sample weights and verify that your weighted D² matches
sklearn.metrics.r2_score(..., sample_weight=...).Try Platt scaling or isotonic regression to calibrate a model; measure the change in Brier and D².
References#
Brier score (Wikipedia): https://en.wikipedia.org/wiki/Brier_score
scikit-learn
brier_score_loss: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.htmlscikit-learn
r2_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html